Tokenization in Natural Language Processing (Making machines understand us)

Hi, and welcome to natural language processing using TensorFlow. If you're not an expert on AI or ML, don't worry. We're taking the concepts of NLP and teaching them from first principles. Here, we'll talk about how to represent words in a way that a computer can process them, with a view to later training a neural network that can understand their meaning. This process is called tokenization.

So let's take a look.

Consider the word "listen," as you can see here. It's made up of a sequence of letters. These letters can be represented by numbers using an encoding scheme. A popular one called ASCII has these letters represented by these numbers. This bunch of numbers can then represent the word listen. But the word silent has the same letters, and thus the same numbers, just in a different order. So it makes it hard for us to understand sentiment of a word just by the letters in it. So it might be easier, instead of encoding letters, to encode words.

Consider the sentence I love my dog. So what would happen if we start encoding the words in this sentence instead of the letters in each word? So, for example, the word "I" could be one, and then the sentence "I love my dog" could be 1, 2, 3, 4. Now, if I take another sentence, for example, "I love my cat," how would we encode it? Now we see "I love my" has already been given 1, 2, 3, so all I need to do is encode "cat." I'll give that the number 5. And now, if we look at the two sentences, they are 1, 2, 3, 4 and 1, 2, 3, 5, which already show some form of similarity between them. And it's a similarity you would expect, because they're both about loving a pet.

Given this method of encoding sentences into numbers, now let's take a look at some code to achieve this for us. This process, as I mentioned before, is called tokenization, and there's an API for that. We'll look at how to use it with Python.

import tensorflow as tf

from tensorflow import keras

from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [

I love my dog',

I love my cat

]

tokenizer = Tokenizer(num_word = 100)

tokenizer.fit_on_text (sentences)

word_index = tokenizer.word_index

print(word_index)

Output:

{'i':1 , 'my':3 , 'dog':4 , 'cat':5 , 'love':2}

So here's your first look at some code to tokenize these sentences. Let's go through it line by line. First of all, we'll need the tokenizer APIs, and we can get these from TensorFlow Keras. We can represent our sentences as a Python array of strings like this. It's simply the "I love my dog" and "I love my cat" that we saw earlier.

Now the fun begins. I can create an instance of a tokenizer object. The num_words parameter is the maximum number of words to keep. So instead of, for example, just these two sentences, imagine if we had hundreds of books to tokenize, but we just want the most frequent 100 words in all of that. This would automatically do that for us when we do the next step, and that's to tell the tokenizer to go through all the text and then fit itself to them like this.

The full list of words is available as the tokenizer's word index property. So we can take a look at it like this and then simply print it out. The result will be this dictionary showing the key being the word and the value being the token for that word. So for example, my has a value of 3.

The tokenizer is also smart enough to catch some exceptions. So for example, if we updated our sentences to this by adding a third sentence 'You love my dog!' , noting that "dog" here is followed by an exclamation mark, the nice thing is that the tokenizer is smart enough to spot this and not create a new token. It's just "dog." And you can see the results here. There's no token for "dog exclamation," but there is one for "dog." And there is also a new token for the word "you."

You've now seen how words can be tokenized, and the tools in TensorFlow that handle that tokenization for you. Now that your words are represented by numbers like this, you'll next need to represent your sentences by sequences of numbers in the correct order. You'll then have data ready for processing by a neural network to understand or maybe even generate new text. You'll see the tools that you can use to manage this sequencing in the next blog.

Prasanna Dhungel

Friday, 4 December 2020

Tokenization in Natural Language Processing (Making machines understand us)

1 comment:

about me

Prasanna Dhungel

Tags

Followers

Popular Posts

Follow me

Blog Archive